[torch nightlies] Use main Dockerfile with flags for nightly torch tests#244
[torch nightlies] Use main Dockerfile with flags for nightly torch tests#244
Conversation
|
@khluu I might need your help on this one and/or have you point me to an expert on Buildkite configs. I'm trying to use the standard Docker builds here for PyTorch nightly testing, but need to also run Current status is that the main Docker image is used (which is good), but tests are all running on release PyTorch versions (not good) without the latest changes. Latest failing run is at https://buildkite.com/vllm/ci/builds/42927/steps/canvas?sid=019b0a30-bee1-4b6b-8393-7f85b537d2ef with the error because of e596c0d#diff-b5c060fa4acd68fd48a2b3cdcd4069bd9eae5b0ee8512e1b25d8f8e2526834e5R480 Any thoughts? cc @atalman as well and I'll keep digging. |
I think the |
|
Good call on needing build as well as test signal. Let me see what I can do to modify the base Dockerfile. |
5424fa5 to
55368b3
Compare
55368b3 to
30eb1d6
Compare
|
Code changes should be ready for review after the final Buildkite test runs. And now done. |
| timeout_in_minutes: 600 | ||
| commands: | ||
| - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" | ||
| - "aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com" |
There was a problem hiding this comment.
Note that this matches standard Dockerfile build flags both here and down below. The key is that vLLM builds with PT nightlies and standard vLLM builds should be identical here minus the --build-arg PYTORCH_NIGHTLY=1 flag. Unfortunately we can't unify further yet, but we can do that in some additional commits.
There was a problem hiding this comment.
For the main Nvidia GPU build this has now moved to buildkite/scripts/ci-bake.sh, but we should still keep these incremental changes towards what the main build was doing.
5cbda75 to
0edf32b
Compare
Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com>
0edf32b to
7424f16
Compare
|
@simon-mo and @khluu thank you for merging vllm-project/vllm#30443! This is the other part of it. After this one we can land vllm-project/vllm#32426 |
Use standard Docker image instead of torch_nightly image for PyTorch nightlies testing and CI runs.
Moving this from #239 to a branch on upstream for testing purposes outlined at https://github.com/vllm-project/ci-infra?tab=readme-ov-file#how-to-test-changes-in-this-repo
Tests to confirm:
vllmfork matching HEAD, noci-infrachanges) at https://buildkite.com/vllm/ci/builds/42874/steps/canvas. Allowed 5 test runs to move forward. -> Seems like PT nightlies build itself failed on installingflashinferso all tests failed afterwards.vllmchanges at [CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests vllm#30443, myci-infrachanges at [torch nightlies] Use main Dockerfile with flags for nightly torch tests #244) with a successful build at https://buildkite.com/vllm/ci/builds/45736/steps/canvas?sid=019b9459-43ce-46d3-99c2-c10a1a8ce96c. One downstream test is failing, but that looks real and something we will investigate.We will remove https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.nightly_torch in a separate commit.
Looking for review and landing with help from @khluu, @amrmahdi, @atalman, @huydhn . Thanks!